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Re-scale boosting for regression and classification 

Shaobo Lin, Yao Wang , and Lin Xu 


Abstract —Boosting is a learning scheme that combines 
weak prediction rules to produce a strong composite 
estimator, with the underlying intuition that one can obtain 
accurate prediction rules by combining “rough” ones. Al¬ 
though boosting is proved to be consistent and overfitting- 
resistant, its numerical convergence rate is relatively slow. 
The aim of this paper is to develop a new boosting strategy, 
called the re-scale boosting (RBoosting), to accelerate the 
numerical convergence rate and, consequently, improve the 
learning performance of boosting. Our studies show that 
RBoosting possesses the almost optimal numerical conver¬ 
gence rate in the sense that, up to a logarithmic factor, 
it can reach the minimax nonlinear approximation rate. 
We then use RBoosting to tackle both the classification 
and regression problems, and deduce a tight generalization 
error estimate. The theoretical and experimental results 
show that RBoosting outperforms boosting in terms of 
generalization. 

Index Terms —Boosting, re-scale boosting, numerical 
convergence rate, generalization error 

I. Introduction 

Contemporary scientific investigations frequently en¬ 
counter a common issue of exploring the relationship 
between a response and a number of covariates. Sta¬ 
tistically, this issue can be usually modeled to minimize 
either an empirical loss function or a penalized empirical 
loss. Boosting is recognized as a state-of-the-art scheme 
to attack this issue and has triggered enormous research 
activities in the past twenty years fTTH . lfT5l . lfl8ll . l26l . 

Boosting is an iterative procedure that combines weak 
prediction rules to produce a strong composite learner, 
with the underlying intuition that one can obtain accurate 
prediction rules by combining “rough” ones. The gradi¬ 
ent descent view ffT8ll of boosting shows that it can be 
regarded as a step-wise fitting scheme of additive mod¬ 
els. This statistical viewpoint connects various boosting 
algorithms to optimization problems with corresponding 
loss functions. For example, L 2 boosting (7| can be 
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interpreted as a stepwise learning scheme to the L 2 risk 
minimization problem. Also, AdaBoost lfl6l corresponds 
to an approximate optimization of the exponential risk. 

Although the success of the initial boosting algorithm 
(Algorithm [[] below) on many data sets and its “re¬ 
sistance to overfitting” were comprehensively demon¬ 
strated 0, |[T6ll . the problem is that its numerical 
convergence rate is usually a bit slow lf24ll . In fact, 
Livshits lf24l proved that for some sparse target func¬ 
tions, the numerical convergence rate of boosting lies 
in (Cofc -0 ' 1898 , Cq/c - 0 - 182 ), which is much slower than 
the minimax nonlinear approximation rate 0(k~ 1 / 2 ). 
Here and hereafter, k denotes the number of iterations, 
and C \), C( } are absolute constants. Various modified 
versions of boosting have been proposed to accelerate 
its numerical convergence rate and then to improve its 
generalization capability. Typical examples include the 
regularized boosting via shrinkage (RSboosting) lU2l 
that multiplies a small regularization factor to the step- 
size deduced from the linear search, regularized boosting 
via truncation (RTboosting) |[34ll which truncates the 
linear search in a small interval and e-boosting lf20l that 
specifies the step-size as a fixed small positive number 
e rather than using the linear search. 

The puipose of the present paper is to propose a 
new modification of boosting to accelerate the numerical 
convergence rate of boosting to the near optimal rate 
OikT 1 ! 2 log k) . The new variant of boosting, called 
the re-scale boosting (RBoosting), cheers the philosophy 
behind the faith “no pain, no gain”, that is, to derive 
the new estimator, we always take a shrinkage operator 
to re-scale the old one. This idea is similar as the 
“greedy algorithm with free relaxation ” |[30tl or “se¬ 
quential greedy algorithm” ll33l in sparse approximation 
and is essentially different from Zhao and Yu’s Blasso 
l35ll . since the shrinkage operator is imposed to the 
composite estimator rather than the new selected weak 
learner. With the help of the shrinkage operator, we can 
derive different types of RBoosting such as the re-scale 
AdaBoost, re-scale Logitboost, and re-scale L 2 boosting 
for regression and classification. 

We present both theoretical analysis and experimental 
verification to classify the performance of RBoosting 
with convex loss functions. The main contributions can 
be concluded as four aspects. At first, we deduce the 
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(near) optimal numerical convergence rate of RBoosting. 
Our result shows that RBoosting can improve the nu¬ 
merical convergence rate of boosting to the (near) opti¬ 
mal rate. Secondly, we derive the generalization error 
bound of RBoosting. It is shown that the generaliza¬ 
tion capability of RBoosting is essentially better than 
that of boosting. Thirdly, we deduce the consistency 
of RBoosting. The consistency of boosting has already 
justified in 0 for AdaBoost. The novelty of our result is 
that the consistency of RBoosting is built upon relaxing 
the restrictions to the dictionary and providing more 
flexible choice of the iteration number. Finally, we exper¬ 
imentally compare RBoosting with boosting, RTboost- 
ing, RSboosting and e-boosting in both regression and 
classification problems. Simulation results demonstrate 
that, similar to other modified versions of boosting, 
RBoosting outperforms boosting in terms of prediction 
accuracy. 

The rest of paper can be organized as follows. In 
Section 2, we introduce RBoosting and compare it with 
other related algorithms. In Section 3, we study the 
theoretical behaviors of RBoosting, where its numer¬ 
ical convergence, consistency and generalization error 
bound are derived. In Section 4, we employ a series of 
simulations to verify our assertions. In the last section, 
we draw a simple conclusion and present some further 
discussions. 


II. Re-scale boosting 

In classification or regression problems with a co¬ 
variate or predictor variable X on X C R d and a 
real response variable Y , we observe m i.i.d. samples 
Z m = {(X\,Yi),..., (X m ,Y m )} from an unknown 
distribution D. Consider a loss function 0(/, y) and 
define Q{f ) (true risk) and Q m (f ) (empirical risk) as 

Q(f) = EM(X),Y), 

and 

j m 

Qm(f) = Ez <Kf(X),Y) = -J2 Yi), 

i=i 

where E# is the expectation over the unknown true 
joint distribution D of (X. Y) and Ez is the empirical 
expectation based on the sample Z m . 

Let S = {gi,...,g n } be the set of weak learners 
(classifiers or regressors) and define 

f n 

Span(5) = < ^2 a j9j '■ 9j € S, a 3 € R, n € N 

b=i J 

We assume that 6, therefore Q rn , is Frechet differentiable 
and denote by Q' m (f,h ) = (VQ m (/),/i) the value of 



linear functional XQ m (f) at h, where VQ m (/) satisfies, 
for all f,g£ Span(S'), 

Inn j(Q m (/ + th) - Qm(f )) = (VQ m (/), h ). 

Then the gradient descent view of boosting |[T8l can 
be interpreted as the following Algorithm |T| 


Algorithm 1 Boosting 

Step 1 (Initialization): Given data {(AQ,Yj) : i = 
1,..., m}, weak learner set (or dictionary) S, iteration 
number k*, and /o € Span(5). 

Step 2 (Projection of gradient): Find € S such that 

-Qm(fk-U g*k) = sup-Q^(/ fc _i,g). 

g&S 

Step 3 (Linear search): Find ^|£R such that 
Qm{fk + Pk 9 k) = Qm{fk + Pk 9 k)- 

PfeGR 

Update f k+ 1 = f k + f 3 * k g* k . 

Step 4 (Iteration): Increase k by one and repeat Step 
2 and Step 3 if k < k*. 


Although this original boosting algorithm was proved 
to be consistent 0 and overfrtting resistant 1 1771 . a 
series of studies 0, Il24ll . |[29l showed that its numerical 
convergence rate is far slower than that of the best 
nonlinear approximant. The main reason is that the linear 
search in Algorithm |T] makes f k be not always the 
greediest one. In particular, as shown in Fig.l, if f k ~i 
walks along the direction of g k to 9og k , then there usually 
exists a weak learner g such that the angle a = (3. That 
is, after 8og k , continuing to walk along g k is no more the 
greediest one. However, the linear search makes fk-i go 
along the direction of g k to 6\g k . 



Under this circumstance, an advisable method is to 
control the step-size in the linear search step of Algo¬ 
rithm 1. Thus, various variants of boosting, comprising 
the RTboosting, RSboosting and e-boosting, have been 
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developed based on different strategies to control the 
step-size. It is obvious that the main difficulty of these 
schemes roots in how to select an appropriate step-size. If 
the step size is too large, then these algorithms may face 
the same problem as that of Algorithm Q] On the other 
hand, if the step size is too small, then the numerical 
convergence rate is also fairly slow QQ. 

Different from the aforementioned strategies that focus 
on controlling the step-size of gj), we drive a novel 
direction to improve the numerical convergence rate and 
consequently, the generalization capability of boosting. 
The core idea is that if the approximation (or learning) 
effect of the fc-th iteration is not good, then we regard 
fk to be too aggressive and therefore shrink it within a 
certain extent. That is, if a new iteration is employed, 
then we impose a re-scale operator on the estimator fk . 
This is the reason why we call our new strategy as the 
re-scale boosting (RBoosting). The following Algorithm 
2 depicts the main idea of RBoosting. 


Algorithm 2 Re-scale boosting 

Step 1 (Initialization): Given data {(Ay, Y t ) : i = 

l,, m}, weak learner set S, a set of shrinkage de¬ 
gree {a k } k L 1 , iteration number k*, and /q <E Span(S). 

Step 2 (Projection of gradient): Find € S such that 

-Q'mifk-i, 9k) = sup —Q'mifk-i, g)- 

g&S 

Step 3 (Linear search): Find such that 

Qm({l-a k )fk+f3k9k) = inf Qm(0--<x k )f k +P k g k ) 

PfcGK 

Update f k+ 1 = (1 - a k )f k + /3* k g* k . 

Step 4 (Iteration): Increase k by one and repeat Step 

2 and Step 3 if k < k*. 

Compared Algorithm [2] with Algorithm [Q the only 
difference is that we employ a re-scale operator (1 — 
Oiff) fk in the linear search step of RBoosting. Flere and 
hereafter, we call a k as the shrinkage degree. It can 
be easily found that RBoosting is similar as the greedy 
algorithm with free relaxation (GAFR) ll30l and the X- 
greedy algorithm with relaxation (XGAR IT ll28l . Il33l in 
sparse approximation. In fact, RBoosting can be regarded 
as a marriage of GAFR and XGAR. To be detailed, we 
adopt the projection of gradient of GAFR and the linear 
search of XGAR to develop Algorithm [2] 

‘In (33), XGAR was called as the sequential greedy algorithm, 
while in (3, XGAR was named as the relaxed greedy algorithm for 
brevity. 


It should be also pointed out that the present paper is 
not the first one to apply relaxed greedy-type algorithms 
in the realm of boosting. In particular, for the L 2 loss, 
XGAR has already been utilized to design a boosting- 
type algorithm for regression in jT]. Since in both GAFR 
and XGAR, one needs to tune two parameters simulta¬ 
neously in an optimization problem, GAFR and XGAR 
are time-consuming when faced with a general convex 
loss function. This problem is successfully avoided in 
RBoosting. 

III. Theoretical behaviors of RBoosting 

In this section, we study the theoretical behaviors of 
RBoosting. We hope to address three basic issues re¬ 
garding RBoosting, including its numerical convergence 
rate, consistency and generalization error estimate. 

To state the main results, some assumptions concern¬ 
ing the loss function 0 and dictionary S should be 
presented. The first one is a boundedness assumption 
of 5. 

Assumption 1: For arbitrary g £ S and x € X, there 
exists a constant Ci such that 

n 

5 Z9i ( x ) < Ci- 

i=l 

Assumption Q] is certainly a bit stricter than the as¬ 
sumption sup 9& 5 !XGA > |< 7 j(a;)| < 1 in li30l . l34l . Introduc¬ 
ing such a condition is only for the puipose of deriving 
a fast numerical convergence rate of RBoosting with 
general convex loss functions. In fact, for a concrete 
loss function such as the L p loss with 1 < p < 00 , 
Assumption Q] can be relaxed to sup 9g 5 ia . gA > \gi{x)\ < 1 
If28l . Assumption [Tj essentially depicts the localization 
properties of the weak learners. Indeed, it states that, 
for arbitrary fixed x € X, expert for a small number 
of weak learners, all the \gi(x)\'s arc very small. Thus, 
it holds for almost all the widely used weak learns 
such as the trees fl8ll . stumps If34l . neural networks 
dll and splines Q. Moreover, for arbitrary dictionary 
S’ = {g [,..., g' n }, we can rebu ild it as S = {g 1: ...,g n } 

with gi = g[/(\JY a =1 ) 2 ( x ))• It should be noted that 
Assumption 1 is the only condition concerning the dictio¬ 
nary throughout the paper, which is different from l3l . 
lf34l that additionally imposed either VC-dimension or 
Rademacher complexity constraints to the weak learner 
set S. 

We then give some restrictions to the loss function, 
which have already adopted in Q, ||4]], ll33l . | [34l . 

Assumption 2: (i) If \f(x)\ < TZ\, \y\ < 1Z 2 , then 
there exists a continuous function H$ such that 

m,y)\ < HtiK^Kz). ( 1 ) 
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(ii) Let V = {/ : Q m (f) < Q m (0)} and 

f* = min fevQm(f)- Assume that Vci,C 2 satisfying 
QmU*) < Cl < c 2 < Qm(0), there holds 

0 < inf {Q' m (f,g) ■■ ci < Q(f) <c 2 ,g € 5} (2) 
< sup {Qm(f,g) ■ Qm{f ) < c 2 ,/i G S'} < oo.(3) 

It should be pointed out that (i) concerns the bounded¬ 
ness of cj) and therefore is mild. In fact, if 1Z\ and 72 2 are 
bounded, then (i) implies that 4>(f, y) is also bounded. It 
is obvious that (i) holds for almost all commonly used 
loss functions. Once 0 is given, H ( f > (lZi,'R, 2 ) can be 
determined directly. For example, if cj) is the L 2 loss 
for regression, then IZ 2 ) < (72i + 7£ 2 ) 2 ; if (f> is 

the exponential loss for classification, then IZ 2 = 1 and 
H t j > {TZ\,TZ 2 ) < exp{72i}; if (f> is the logistic loss for 
classification, then < log(l + exp{7£i}). 

As Qm(f) = T,iLi<i>(f{ x i),Yi)> conditions © and 
© actually describe the strict convexity and smoothness 
of cj) as well as Q m . Condition © guarantees the 
strict convexity of Q m in a certain direction. Under 
this condition, the maximization (and minimization) in 
projection of gradient step (and linear search step) of 
Algorithms [Tj and [2] are well defined. Condition © deter¬ 
mines the smoothness property of Q m (f). For arbitrary 
f(x) G [—A, A], define the first and second moduli of 
smoothness of Q m {f ) as 

PliQmi'U) — SUp | Qm(f T uh) Qm(/)|> 


and 


P2(Qm,U ) = SUp | Qm{f + Uh) 

f,\M=l 

+ Qmif ~ uh) - 2Q m (f)\, 

where || • || denotes the uniform norm. It is easy to deduce 
that if © holds, then there exist constants C 2 and C 3 
depending only on A and c 2 such that 

Pi{Qm,u) < C 2 IMI, and p 2 (Qm,u) < C 3 \\u\\ 2 . (4) 

It is easy to verify that all the widely used loss functions 
such as the L 2 loss, exponential loss and logistic loss 
satisfy Assumption 2. 

By the help of the above stations, we are in a posi¬ 
tion to present the first theorem, which focuses on the 
numerical convergence rate of RBoosting. 

Theorem 1: Let fk be the estimator defined by Algo¬ 
rithm [2] If Assumptions Q] and [2] hold and a 4 . = 
then for any h € Span(S), there holds 

Qm{fk ) - Qm(h) < C(\\h\\l + \ogk)k~ 1 , (5) 


where C is a constant depending only on c\, c 2 , Cj, and 


|i = inf V| 

(aUy =1 e 


n 

for h = J2 a J9j- 

3 = 1 


If </>(/, y) = (f(x) — y ) 2 and S is an orthogonal basis, 
then there exists an h* G Span(5) with bounded ||/i*||i 
such that f9l 


\Qm(fk) ~ Qm(h*)\ >Ck~\ 


where C is an absolute constant. Therefore, the numeri¬ 
cal convergence rate deduced in © is almost optimal in 
the sense that for at least some loss functions (such as the 
L 2 loss) and certain dictionaries (such as the orthogonal 
basis), up to a logarithmic factor, the deduced rate is 
optimal. Compared with the relaxed greedy algorithm 
for convex optimization lITOl . lf30l that achieves the 
optimal numerical convergence rate, the rate derived in 
© seems a bit slower. However, in lUOl . lf30l . the set 
V = {/ : Qm{f) < <2?n(0)} is assumed to be bounded. 
This is a quite strict assumption and, to the best of our 
knowledge, it is difficult to verify whether the widely 
used L 2 loss, exponential loss and logistic loss satisfy 
this condition. In Theorem Q3 we omit this condition in 
the cost of adding an additional logarithmic factor to the 
numerical convergence rate and some other easy-checked 
assumptions to the loss function and dictionary. 

Finally, we give an explanation why we select the 
shrinkage degree ak as o:> = From the definition 
of fk, it follows that the numerical convergence rate may 
depend on the shrinkage degree. In particular, Bagirov 
et al. |jT[, Barron et al. © and Temlyakov Il28l used 
different o>. to derive the optimal numerical convergence 
rates of relaxed-type greedy algorithms. After checking 
our proof, we find that our result remains correct for 
arbitrary = Cs k+c e < * with C 4 , C-, , C), some finite 
positive integers. The only difference is that the constant 
C in © may be different for different «&. We select 
a:/,. = j-Kj is only for the sake of brevity. 

Now we turn to derive both the consistency and 
learning rate of RBoosting. The consistency of the 
boosting-type algorithms describes whether the risk of 
boosting can approximate the Bayes risk within arbitrary 
accuracy when m is large enough, while the learning rate 
depicts its convergence rate. Several authors have shown 
that Algorithm Q] with some specific loss functions is 
consistent. Three most important results can be found in 
f3ll . Pill . ll22l . Jiang |[22ll proved a process consistency 
property for Algorithm [TJ under certain assumptions. 
Process consistency means that there exists a sequence 
{t m } such that if boosting with sample size m is 
stopped after t m iterations, its risk approaches the Bayes 
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risk. However, Jiang imposed strong conditions on the 
underlying distribution: the distribution of X has to 
be absolutely continuous with respect to the Lebesgue 
measure. Furthermore, the result derived in |[22ll didn’t 
give any hint on when the algorithm should be stopped 
since the proof was not constructive. (3), 0 improved 
the result of ll22l and demonstrated that a simple stopping 
rule is sufficient for consistency: the number of iterations 
is a fixed function of m. However, it can also be found 
in |[3l, 0 that the deduced learning rate was fairly slow. 
(3] Th. 6 ] showed that the risk of boosting converges to 
the Bayes risk within a logarithmic speed. 

Without loss of generality, we assume \Yj\ < M 
almost surely with M > 0. The following Theorem 0 
plays a crucial role in deducing both the consistency and 
fast learning rate of RBoosting. 

Theorem 2: Let //. be the estimator obtained in Al¬ 
gorithm [2] If a k = -jrp 3 and Assumptions Q] and [2] hold, 
then for arbitrary h € Span(S'), there holds 

E {Q(f k ) - Q(h)} < C(\\h\\i + log k)k~ l 

+C ’( g ,(log<c,M) +g ,(ll'»l|,.M)) t(10gm + 10g * ) . 

m 

where C and C' are constants depending only on ci, C 2 
and C\. 

Before giving the consistency of RBoosting, we 
should give some explanations and remarks to Theorem 
[2] Firstly, we present the values of H^(logk,M) and 
JT^,(||/i||i, M). Taking H^ilog k,M ) for example, if cf> is 
the L 2 loss for regression, then log k,M) = (log k+ 
M) 2 , if cf) is the logistic loss for classification, then 
H,p(logk, M) = log (A; + 1) and if 4> is the exponential 
loss for classification, then H,p{\og k, M) = k. Secondly, 
we provide a simple method to improve the bound in 
Theorem [2j Let ttmKx) '■= min{M, |/(x)|}sgn(/(x)) 
be the truncation operator at level M. As Y € [— M, M} 
almost surely, there holds |[36ll 

E{Q(ir M fk) - Q(h)} < E {Q(f k ) - Q(h)}. 

Noting that there is not any computation to do such 
a truncation, this truncation technique has been widely 
used to rebuild the estimator and improve the learning 
rate of boosting 0 - 0 . However, this approach has a 
drawback: the usage of the truncation operator entails 
that the estimator nMfk is (in general) not an element 
of Span(S'). That is, one aims to build an estimator in 
a class and actually obtains an estimator out of it. This 
is the reason why we do not introduce the truncation 
operator in Theorem [2j Indeed, if we use the truncation 
operator, then the same method as that in the proof of 
Theorem [2] leads to the following Corollary [0 


Corollary 1: Let /). be the estimator obtained in Al¬ 
gorithm 0 If «/,- = 7-^77 and Assumptions Q] and [ 2 ] hold, 
then for arbitrary h € Span(.S'), there holds 


HQ^Mfk) - Q(h)} < c(\\h\\l + \ogk)k~ 1 

M) + g,(| Wl ,M)) * (l0gm m + l ° gk) . 

where C and C' are constants depending only on c \, C 2 
and C\. 

By the help of Theorem [2] we can derive the consis¬ 
tency of RBoosting. 

Corollary 2: Let ff. be the estimator obtained in Al¬ 
gorithm [2j If O:/. = ^ 3 , Assumptions [I] and [2] hold and 


#0 (log k,M)k log m 

k —>• 00 , — - y 0 , when m —>• 00 , 


m 


( 6 ) 


then 


E {Q(fk)} — > inf Q(f)i when m —> 00 . 

/eSpan(5) 

Corollary [2] shows that if the number of iterations 
satisfies 0, then RBoosting is consistent. We should 
point out that if the loss function is specified, then, we 
can deduce a concrete relation between k and m to 
yield the consistency. For example, if 0 is the logistic 
function, then the condition 0 becomes k ~ m 7 with 
0 < 7 < 1. This condition is somewhat looser than the 
previous studies concerning the consistency of boosting 
0, 0, l(22ll or its modified version 13, lf34ll . 

When used to both classification and regression, there 
usually is an overfitting resistance phenomenon of boost¬ 
ing as well as its modified versions 0, f34l . Our result 
shown in Corollary [2] looks to contradict it at the first 
glance, as k must be smaller than m. We illustrate that 
this is not the case. It can be found in 0, ll34l that 
expect for Assumption 0 there is another condition such 
as the covering number, VC-dimension, or Rademacher 
complexity imposed to the dictionary. We highlight that 
if the dictionary of RBoosting is endowed with a similar 
assumption, then the condition k < m can be omitted 
by using the similar methods in 0, If23l . If34l l. In short, 
our assertions show that whether RBoosting is overfiiting 
resistant depends on the dictionary. 

At last, we give a learning rate analysis of RBoosting, 
which is also a consequence of Theorem [2] 

Corollary 3: Let /). be the estimator obtained in Al¬ 
gorithm 2. Suppose that ak = 7 - 7-7 and Assumptions 1 
and 2 hold. For arbitrary h £ Span(S'), if k satisfies 


m 


k, M) + H^QlhWi, M) ’ 


( 7 ) 
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then there holds 

E{Q(/ fc ) - Q(h)} _ (8) 

<C' , (^ 0 (logm,M) + ^(||/i|| 1 ,M) 
+\\h\\l)m~ 1 ^ 2 log m, 

where C and C' are constants depending only on c \, C 2 
and C\ and M. 

The learning rate © together with the stopping criteria 
© depends heavily on (p. If cp is the logistic loss for 
classification, then H^logm, M) = log(m + 1) and 
H<l>(\\h\\i,M) = log(||/i||i + 1), we thus derive from 
© that, 

E {Q(fk)~Q(h)} < C'(log(m+l) + \\h\\f)m~ 1/2 logm. 

We encourage the readers to compare our result with 
||34l Th.3.2], Without the Rademacher assumptions, 
RBoosting theoretically performs at least the same as 
that of RTboosting. If cp is the loss for regression, 
we can deduce that 

E {Q(fk) - Q(h)} < C"(logm+ \\h\\j)m~ 1/2 logm, 

which is almost the same as the result in |©. If cp is the 
exponential loss for classification, by setting k ~ m 1 ' 3 , 
we can derive 

E {Q(fk) — Q(h)} < C'(log m + )m _1//3 log m, 

which is much faster than AdaBoost ©. It should be 
noted that if the truncation operator is imposed to the 
RBoosting estimator, then the learning rate of the re¬ 
scale AdaBoost can also be improved to 

E{Q(7Tjvf/fc) — Q{h)} < C"(logm + e^^)m _1//2 logm. 

IV. Numerical Results 

In this section, we conduct a series of toy simulations 
and real data experiments to demonstrate the promising 
outperformance of the proposed RBoosting over the 
original boosting algorithm. For comparison, three other 
popular boosting-type algorithms, i.e., e-boosting ll20l . 
RSboosting lfl8l and RTboosting ll34l . are also consid¬ 
ered. In the following experiments, we utilize the L 2 loss 
function for regression (namely, L2Boost) and logistic 
loss function for classification (namely, LogitBoost). 
Furthermore, we use the CART I© (with the number of 
splits J = 4) to build up the week learners for regression 
tasks in the toy simulations and decision stumps (with 
the number of splits J = 1) to build up the week learners 
for regression tasks in real data experiments and all 
classification tasks. 


A. Toy simulations 

We first consider numerical simulations for regression 
problems.The data are drawn from the following model: 

Y = m(X) + 0 ■ e, (9) 

where X is uniformly distributed on [— 2,2] d with 
d € {1,10}, e, independent of X, is the standard 
gaussian noise and the noise level o varies among 
in {0,0.3, 0.6,1}. Two typical regression functions HI 
are considered in the simulations. One is a univariate 
piecewise function defined by 

10 \/ — x sin(87rx), < x < 0, 

0, else, ( U) 

and the other is a multivariate continuously differentiable 
sine function defined as 

10 

7712 ( 0 ;) = ^2 sin(xj 2 ). (11) 

3 = 1 

For these regression functions and all values of o, 
we generate a training set of size 500, and then col¬ 
lect an independent validation data set of size 500 to 
select the parameters of each boosting algorithms: the 
number of iterations k, the regularization parameter u 
of RSboosting, the truncation value of RTboosting, the 
shrinkage degree of RBoosting and e of e-boosting. In 
all the numerical examples, we chose v and e from a 
20 points set whose elements are uniformly localized in 
[0.01,1]. We select the truncated value of RTboosting 
the same as that in ll34l . To tune the shrinkage degree, 
ak = 2 /(k + u), we employ 20 values of u which are 
drawn logarithmic equally spaced between 1 to 10 6 . To 
compare the performances of all the mentioned methods, 
a test set of 1000 noiseless observations is used to 
evaluate the performance in terms of the root mean 
squared error (RMSE). 

Table I documents the mean RMSE over 50 indepen¬ 
dent runs. The standard errors are also reported (numbers 
in parentheses). Several observations can be easily drawn 
from Table I. Firstly, concerning the generalization capa¬ 
bility, all the variants essentially outperform the original 
boosting algorithm. This is not a surprising result since 
all the variants introduce an additional parameter. Sec¬ 
ondly, RBoosting performs as the almost optimal variant 
since its RMSEs are the smallest or almost smallest for 
all the simulations. This means that, if we only focus 
on the generalization capability, then RBoosting is a 
preferable choice. 

In the second toy simulation, we consider the “orange 
data” model which was used in |[37| for binary classifica¬ 
tion. We generate 100 data points for each class to build 



TABLE I 

Performance comparison of different boosting algorithms on simulated regression data examples 
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a 

Boosting 

RSboosting 

RTboosting 

RBoosting 

e-boosting 

piecewise function d 10b 

0 

0.2698(0.0495) 

0.2517(0.0561) 

0.3107(0.0905) 

0.2460(0.0605) 

0.2306(0.0827) 

0.3 

0.6204(0.0851) 

0.4635(0.0728) 

0.5131(0.0735) 

0.5112(0.0779) 

0.4844(0.0862) 

0.6 

0.7339(0.0706) 

0.7317(0.0392) 

0.7475(0.0333) 

0.7206(0,0486) 

0.7403(0.0766) 

1 

1.1823(0.0483) 

1.1474(0.0485) 

1.1776(0.0604) 

1.1489(0.0485) 

1.1395( 0.0590) 

continuously differentiable sine functions dl lb 

0 

2.3393(0.1112) 

1.7460(0.0973) 

1.8388(0.1102) 

1.6166(0.0955) 

1.7434(0.0804) 

0.3 

2.4051(0.1112) 

1.7970(0.0951) 

1.8380(0.0830) 

1.6732(0.0928) 

1.7665(0.0718) 

0.6 

2.4350(0.0836) 

1.8866(0.0837) 

1.9628(0.0853) 

1.7730(0.0832) 

1.8895(0.0880) 

1 

2.6583(0.1103) 

2.0671(0.0789) 

2.1575(0.0891) 

1.9870(0.1092) 

2.0766(0.0956) 


up the training set. Both classes have two independent 
standard normal inputs x\, xo, but the inputs for the sec¬ 
ond class conditioned on 4.5 < x\ + x\ < 8. Similarly, 
to make the classification more difficult, independent 
feature noise q were added to the inputs. One can find 
more details about this data set in 1571 . 

Table II reports the classification accuracy of five 
boosting-type algorithms over 50 independent runs. 
Numbers in parentheses are the standard errors. In this 
simulation, for q varies among {0, 2,4, 6}, we generate a 
validation set of size 200 for tuning the parameters, and 
then 4000 observations to evaluate the performances in 
terms of classification error. For this classification task, 
RBoosting outperforms the original boosting in terms of 
the generalization error. It can also be found that as far 
as the classification problem is concerned, RBoosting 
is at least comparable to other variants. Here we do 
not compare the performance with the performance of 
SVMs reported in lf37l . because the main puipose of 
our simulation is to highlight the outperformance of the 
proposed RBoosting over the original boosting. 

All the above toy simulations from regression to 
classification verify the theoretical assertions in the last 
section and illustrate the merits of RBoosting. 

B. Real Data Examples 

In this subsection, We pursue the performance of 
RBoosting on eight real data sets (the first five data sets 
for regression and the others for classification). 

The first data set is the Diabetes data set |[T3l . This 
data set contains 442 diabetes patients that are measured 
on ten independent variables, i.e., age, sex, body mass 
index etc. and one response variable, i.e., a measure 
of disease progression. The second one is the Boston 
Housing data set created from a housing values survey 
in suburbs of Boston by Harrison and Rubinfeld fl2T1 . 
This data set contains 506 instances which include 
thirteen attributions, i.e., per capita crime rate by town, 


proportion of non-retail business acres per town, average 
number of rooms per dwelling etc. and one response 
variable, i.e., median value of owner-occupied homes. 
The third one is the Concrete Compressive Strength 
(CCS) data set created from lf32l . The data set contains 
1030 instances including eight quantitative independent 
variables, i.e., age and ingredients etc. and one dependent 
variable, i.e., quantitative concrete compressive strength. 
The fourth one is the Prostate cancer data set derived 
from a study of prostate cancer by Blake et al. Q. The 
data set consists of the medical records of 97 patients 
who were about to receive a radical prostatectomy. The 
predictors are eight clinical measures, i.e., cancer vol¬ 
ume, prostate weight, age etc. and one response variable, 
i.e., the logarithm of prostate-specific antigen. The fifth 
one is the Abalone data set, which comes from an 
original study in ll25ll for predicting the age of abalone 
from physical measurements. The data set contains 4177 
instances which were measured on eight independent 
variables, i.e., length, sex, height etc. and one response 
variable, i.e., the number of rings. For classification task, 
three benchmark data sets are considered, namely Spam, 
Ionosphere and WDBC, which can be obtained from 
UCI Machine Learning Repository. Spam data contains 
4601 instances, and 57 attributes. These data are used 
to measure whether an instance is considered to be 
spam. WDBC (Wisconsin Diagnostic Breast Cancer) 
data contains 569 instances, and 30 features. These data 
are used to identify whether an instance is diagnosed 
to be malignant or benign. Ionosphere data contains 
351 instances, and 34 attributes. These data are used to 
measure whether an instance was “good” or “bad”. 

For each real data, we randomly (according to the 
uniform distribution) select 50% data for training, 25% 
data to build the validation set for tuning the param¬ 
eters and the remainder 25% data as the test set for 
evaluating the performances of different boosting-type 
algorithms. We repeat such randomization 20 times and 
report the average errors and standard errors (numbers 























TABLE II 

Performance comparison of different boosting algorithms on simulated “orange data” example 
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Boosting 

RSboosting 

RTboosting 

RBoosting 

e-boosting 

0 

11.19(1.32)% 

10.36(1.16)% 

10.50(1.19)% 

10.44(1.12)% 

10.29(1.17)% 

2 

11.27(1.29)% 

10.48(1.24)% 

10.71(1.19)% 

10.59(1.25)% 

10.60(1.28)% 

4 

11.79(1.54)% 

10.79(1.21)% 

11.07(1.41)% 

10.90(1.24)% 

10.94(1.26)% 

6 

12.02(1.62)% 

10.93(1.21)% 

11.20(1.23)% 

10.91(1.28)% 

11.02(1.32)% 


TABLE III 

Performance comparison of different boosting algorithms on real data examples 


dataset 

Boosting 

RSboosting 

RTboosting 

RBoosting 

e-boosting 

Diabetes 

59.0371(4.1959) 

55.3109(3.6591) 

56.1343(3.2543) 

55.6552(4.5351) 

57.7947(3.3970) 

Housing 

4.4126(0.5311) 

4.2742(0.7026) 

4.3685(0.3929) 

4.1752(0.3406) 

4.1244(0.3322) 

CCS 

5.4345(0.5473) 

5.2049(0.1678) 

5.5826(0.1901) 

5.3711(0.1807) 

5.9621(0.1960) 

Prostate 

0.3131(0.0598) 

0.1544(0.0672) 

0.2450(0.0631) 

0.1193(0.0360) 

0.1939(0.0545) 

Abalone 

2.2180(0.0710) 

2.1934(0.0504) 

2.3633(0.0762) 

2.1922(0.0574) 

2.2098(0.0474) 

Spam 

6.06(0.60)% 

5.13(0.52)% 

5.24(0.48)% 

5.06(0.55)% 

5.02(0.51)% 

Ionosphere 

8.27(2.88)% 

5.80(1.92)% 

6.09(2.24)% 

5.23(2.31)% 

5.92(2.64)% 

WDBC 

5.31(2.11)% 

2.45(1.39)% 

2.69(1.58)% 

2.09(1.55)% 

2.52(1.33)% 


in parentheses) in Table III. The parameter selection 
strategies of all boosting-type algorithms are the same 
as those in the toy simulations. It can be easily observed 
that, all the variants outperform the original boosting 
algorithm to a large extent. Furthermore, RBoosting at 
least performs as the second best algorithm among all 
the variants. Thus, the results of real data coincide with 
the toy simulations and therefore, experimentally verify 
our theoretical assertions. That is, all the experimental 
results show that the new idea “re-scale” of RBoosting 
is numerically efficient and comparable to the idea 
“regularization” of other variants of boosting. This paves 
a new road to improve the performance of boosting. 

V. Proof of Theorem □ 

To prove Theorem [Q we need the following three 
lemmas. The first one is a small generalization of j28l 
Lemma 2.3]. For the sake of completeness, we give a 
simple proof. 

Lemma 1: Let jo > 2 be a natural number. Suppose 
that three positive numbers ci < C 2 < jo, Co be given. 
Assume that a sequence {a n }^ =1 has the following two 
properties: 

(i) For all 1 < n < jo, 

Hn ^ Con , 


and, for all n > jo, 

a n < a n -i + C 0 (n - 1) -Cl 
(ii) If for some v > jo we have 


then 


a v +i < a v ( 1 - c 2 /v). 


Then, for all n = 1, 2,... , we have 


a n < 2 1+ 


c 2- c iCon Cl . 


Proof: For 1 < v < jo, the inequality 

a v > Cqv~ Ci 


implies that the set 

V = {v : a v > Cov~ Cl } 

does not contain v = 1,2,..., jo. We now prove that for 
any segment [n, n + k] C V, there holds 

c l + 1 

k < (2 c 2 _c i — 1 )n. 

Indeed, let n > jo + 1 be such that n — \ f V , which 
means 


a n +j > C 0 {n + j) Cl , j = 0,1,... , k. 

Then by the conditions (i) and (ii), we get 

a n +k < a n II”±* -1 (l - c 2 /v) 

< {an -1 + Coin - 1 )- Cl )n”+,"r 1 (l - c 2 /v). 

Thus, we have 

in + fc)- Cl < < 2 (n - - c 2 /v), 

where c 2 < jo < v. Taking logarithms and using the 
inequalities 


a%i 2 Cqv , 


ln(l — t) < —t, t € [ 0 , 1 ); 





























m— 1 


^ u 1 > / t 1 dt = In (m/n), 


we can derive that 


. 7 n-\-k —1 

—ciln—--<ln2+ Y ln(l — C 2 A 

n — 1 


n+fc—1 


< In 2 — C 2 /n < In 2 — C 2 In 


n + k 


n 


Hence, 


(02 — ci) ln(n + fc) < In 2 + (C 2 — ci) In n + c\ In 


n 


n 


_l_ k < 2 ( Cl+1 )/( C2_Cl ) 


n 


ci + 1 


k < 2 c 2 —c i — 1 n. 


m 


ci +1 


Since k < 2 c 2-°i — 1 ?r, we then have 


m n + k C i +C1 
< - < 2 c 2 — C 1 . 


n — 1 n 

This means that 


i+ c i 


a m < 2Com Cl 2 c 2 -‘=i, 

which finishes the proof of Lemma Q] 

The convexity of Q m implies that for any /, g, 

Qm(g) > Qm(f) + Q'mif > 9 /), 
or, in other words, 


where -Mi(S) = {span(S) : ||/||i < 1}. 

Proof of Theorem [7} We divide the proof into two 
steps. The first step is to deduce an upper bound of f k 
in the uniform metric. Since f k +i = (1 — ctk+i)fk + 
ft+i9*k+v we have 


fk ~ fk+1 + 


a k+ ifk+i - ft+i9k+i 

1 — Olk+ 1 


-| 5 

n — 1 


which implies 


and 


Let us take any m € N. If m ^ V, we have the desired 
inequality. Assume m € V and let [n,n + k\ be the 
maximal segment in V containing m, then we obtain 

dm < a n < a n -\ + C 0 (n - l) _Cl < 2C 0 (n - l) _Cl 

„ ,'n — 1 \ ~ Cl 

< 2 Com 


Noting Qm if) is twice differential, if we use the Taylor 
expansion around f k +i, then 

Qm(.fk ) — Qm(fk+1 ) 

+ Q’ (fk+1 ak+1 ^ k+1 ~ Pk+i9*k+i \ 

m \ +1 ’ l-afc +1 ) 

1^" ( $ a k+lfk+l — Pk+1 * 9k+l 
' o Vm I Jki 1 

^ V l — ®k+i 


— Qm(fk+ 1) T 


ft 


«fc +1 
1 — OLk+ 1 


Q'm {fk+1, fk+l) 


k+1 .o/ 


1 — Otk+1 


Q'm {fh+l,9k+l) 


+ 


+ 


a 


k+1 


2(1 - a k+l f 


(A 


* \2 


2 (l-a fc+ i) : 


Qm ^/fc) fk +1 
>Qm (fk, 


9k+l I, 


where 


/ = (1 _ 6 ) ak+l ^ k+1 — ft+\9k+i + Qf k+l 

1 - a.k+i 

for some 6 € (0,1). For the convexity of Q rn , we have 


a 


fc+i 


;Q' m (fk,fk+i) >0. 


2(1 — ak+i) 2 

Furthermore, if we use the fact that f k +\ is the minimum 
on the path from (1 — a k +i)fk along g k+1 , then it is easy 


to see that 


Q'm {fk+l,9* k + 1 )= 0 . 


Qm(f) - Qm(g ) < Q'm if, f <?) = ~QfmU, 9 ~ /)■ 

Based on this, we can obtain the following lemma, which 
was proved in lf30l Lemma 1.1]. 

Lemma 2 : Let Q m be a Frechet differential convex 
function. Then the following inequality holds for f £ D 

0 < Qm(f + ug)-Q m (f)-uQ' m (f,g) < 2p(A,u\\g\\). 

To aid the proof, we also need the following lemma, 
which can be found in E71 Lemma 2.2]. 

Lemma 3: For any bounded linear F and any dictio¬ 
nary S, we have 

sup F(g) = sup F{f), 
geS feMi(S) 


According to the convexity of Q m again, we obtain 

Q mi fk+ 1 , fk+l) > Qm{fk+1 ) - Qm (0) • 

Noting that = f, we obtain 

Qm(fk) A Qm(fk+1 ) + ^ (Qm (fk+l) — Qm(0)) 

ift+l ) 2 p" fi \ 

+ - 2 - Qm\fk,9k+lJ- 

If we write B = inf {Qm{f,g) : ci < Q m (f) < c 2l g € 
5}, then we have 

(Aj'+i) — ^ ^ Qm{fj ) — Qm(fj+1 ) + yQm(O)^ . 
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Therefore, 


E(^) 2 < 

3= 0 


2 log k 
B ' 


Then it follows from the definition of f). that 

fk = (1 - «fc)(l - a k -i) • • • (1 - a 2 )Plgt 
+ (1 - a k )(l - a k ~i) ■ ■ ■ ( 1 - 03)^252 
+ . .. + (1 — ct k )/3 k _ 1 g k _ 1 + fi k g k . 

Therefore, it follows from the Assumption [Tj that 


l/fcO)| < EWE^*)! 2 < yj2C!logk/B. 

\ 3=0 j =0 

( 12 ) 

Now we turn to the second step, which derives the 
numerical convergence rate of RBoosting. For arbitrary 
j3 k € R and g k € S, it follows form Lemma [2] that 

Qm(( 1 0^k+l)fk T Pk+l9k+l) 

— Qm{fk &k+lfk T" fik+l9k+l) 

< Qm{fk) ~ Pk+l( — Q' m (fk,9k+l)) 

- a k +iQ' m (fk, fk) 

T* 2 p(Q m , || fik+iQk+i QT+t/fcll)- 

From Step 2 in Algorithm 0 </| +1 satisfies 

-Qm(fk,9 k + 1) = sup-Q; n (/ fc ,5). 

365 

Set /3 k = ||A.|| icrfc. It follows from Lemma [3] that 
sup -Q' m (fk,g) = sup —Q' m (fk, <j>) 

g&s cpGMi(S) 

> -\\h\\^Q’ m (fk,h). 

Under this circumstance, we get 

Qm{{ 1 — a k+l)fk + Pk+l9k+l) 

< Qm(fk) ~ OCk+li-Q'mifk, h ~ fk)) 

T 2 p(Q m , \\(3 k +i9k+i a k+ifk\\)- 

Based on Lemma [2] we obtain 

- Q'mifk , h ~ f k ) > Qm(fk) - Qm{h). 

Thus, 


Qm{fk+ 1) — Qm(( 1 Ctfc+l)/fc T 

< Qm(fk)-a k (Qm(fk)-Q-m(h)) 

T 2 p(Qmi || H^lll^fc+lSfc+l — a fc-|_i/fc||). 

Furthermore, according to (fl2l) . we obtain 

||||^||iafc+ifffc+i - «fe+i/fc|| 

< \\h\\ia k +i + a k +i\\f k \\ 

< ||/i||tafe+i + «fe+i||/jk||i 

< (\\h\\i + \J2C\ In k)a k+1 . 


Therefore, 


Qm(fk+ 1 ) - Qm(h) < Qm(fk) ~ Qm{h) 
O^k+liQmffk) Qmiff)) 

+2 p (Q m ,(|H|iW2C'i log k/B)a k+1 ) .(13) 

Now, we use the above inequality and Lemma |Tj to prove 
Theorem Q] Let a k = Q m {fk+i)-Qm(h). Let c 3 € (1, 2] 
and Co be selected later. We then prove the conditions (i) 
and (ii) of Lemma Q] hold for an appropriately selected 
C 0 . Set 


Cq — 2 + 


m 

2 


+ 


72 C 2 
25 


i + 


288C1C2 log k 
25B 2 


Then, it follows from (fl3l) and p(Q m ,u) A C 2 U 2 that 


. -4(0) , 9C*2 

o, 1 A —-- 1 -—— 


1 < Co, a 2 < Cq 2 1 , 


and for v > 2 , there holds 


ct v A d v —\ + Cq(u 1) 


Thus the condition (i) of Lemma Q] holds with jo = 2. 
and a v > Cqv - 1 , then by we get for v > 6 , 


0?;-|-1 A 0^(1 Q^+l 

+2C 2 (\\h\\i + V / 2 Cj^k/J3) 2 a 2 v+1 /a v ) 

/ 3 1 \ 

E a v j 1 —--|-- 1 

- V ^ + 3 2 v + 2) 

- 1 1 -1) ■ 

Thus the condition (ii) of Lemma Q] holds with C 2 = |. 
Applying Lemma Q] we obtain 

Qm(fk) ~ Qm(h) A C{\\h\\ 2 + log k)k~\ 


where C is a constant depending only on £>, Cj and C 2 . 
This finishes the proof of Theorem Q] ■ 


VI. Proof of Theorem [2] 

To aid the proof of Theorem [2j we need the following 
two technical lemmas, both of them can be found in ll36l . 

Let R > 0, we denote Br as the closed ball of 
V k =Span{gf,..., gj ,} with radius R centered at origin: 

Br = {/ G V k : ll/ll A R}. 

Lemma 4: For R > 0 and 77 > 0, we have 

\ogN(B R ,ri) A C^k log ) , 

where A f(B R , rf) denotes the covering number of Br 
with radius rj under the uniform norm. 

The following ratio probability inequality is a standard 
result in learning theory (see lf36l ). 
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Lemma 5: Let Q be a set of functions on Z such that, 
for some c > 0, \g-E(g)\ < B almost everywhere and 
E (g 2 ) < cE(g) for each g € Q. Then, for every e > 0, 
there holds 


' e(j)4ixijw s 

,/e<5 x/E{g)+£ J 


< V(£?,e)exp < - 


me 


2c+f 


Proof of Theorem \ 2]- At first, we use the concen¬ 
tration inequality in Lemma [5] to bound 


Q(fk) - Q(h ) - (Qm(fk) - Qm(h)). 

We need apply Lemma [5] to the set of functions Fr, 
where 


Fr := (V>(Z) = <Kf(X),Y) - Y) : f e Br} . 

Using the obvious inequalities ll/IU < R, \y\ < M 
and Halloo < ||%, from Assumption |T| it follows the 
inequalities 


It follows from (fl2l) that f). € Br with R = C'-, log k, 
then with confidence at least 

1 - exp jc 3 fc log («)} exp {-|^} , 

there holds 

Q(fk) - Q(h) - ( Qm(fk ) - Qm(h)) 
<V~e^Q(fk)~Q{h) + e) 

<\{Q(fk)-Q{h))+e , 

where C\ = C(h, Cf log k. M). Therefore, with the same 
confidence, there holds 

QUk) ~ Q(h ) < 2 (Qm(fk) - Qm(h )) + 2e. 

Since Assumptions Q] and [2] hold, it follows from Theo¬ 
rem [Tj that for any function h € Span(S), there holds 

Qm(fk) - Qm(h ) < C{\\h\\l + log A:)*- 1 , 

where C is a constant depending only on ci, C 2 and C\. 
Combining the last two inequalities yields that 


\^(Z)\<H^R,M) + H^\\h\\i,M) 


T<e 


and 


Ef 2 < (H^R, M) + H^(\\h\\i,M))Eip. 


m 


i=i 


we have, for every e > 0 , 

Q(f)-Q(h)-(Q m (f)-Q m (h)) 

f£B R yfQ{f ) — Q{h ) + £ 

with confidence at least 


< yfe 


1 - exp C 3 k log ( v. exp 


3 me 


e J) f 8 C(h,R,M) 
where C(h,R,M) = (H^R, M) + H^(\\h\\i,M)). 


holds with at least 


1 — exp < C 3 k log 


Cq log k 


For ipi, ^2 € Fr, it follows from Assumption |2] that 
there exists a constant Cf such that 

\MZ)-MZ)\ = Wi,y) - <j>(f 2 ,Y)\ 

< C 4 \h(X) - f 2 (X)\. 

We then get 

N{Fr,£)<M(Br,£/C A ). 

According to Lemma |4j there holds 

log Af(F R ,e) < C 3 k log ^ • 

Employing Lemma 0 with B = c = H ( fR,. M) + 
H+QhWuM) and 

1 m 

Ef = Q(f) - Q(h), - J2 H z i ) = QmU) - Qm(h), 


x exp 


3 me 


8 C(h, C 5 log k, M) J ’ 


where 


T = 


Q(fk)-Q(h)-C(\\h\\ 2 + logk)k 


-1 


For arbitrary g > 0, there holds 

POO 

EpmfT) = / P{T > e}de 


/o 


< g + J exp log 


de 


de 


Cq log k 3 me 
1 fi y e 8C1 

. , / 3 mg\ [°° fC 6 logk' c * k 

- {- r ~ 

r 3mg\ fC 6 logk\ C:sk 

^ M+exp r^rjh^J 

By setting g = fc ) ; direct computation 

yields 


E (T)< 


2 CiC 3 A:(log m + log /c) 
3 m 


That is, 

E{Q(/ fc ) - Q(/t)} 

< Cdl/iHi+logA;)A: _i + 

which finishes the proof Theorem [2] 


2 .,._i 4C\C 3 k(log m + log k) 


3 m 
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VII. Conclusion and further discussions 

In this paper, we proposed a new idea to conquer the 
slow numerical convergence rate problem of boosting 
and then develop a new variant of boosting, named 
as the re-scale boosting (RBoosting). Different from 
other variants such as the e-boosting, RTboosting, RS- 
boosting that control the step-size in the linear search 
step, RBoosting focuses on alternating the direction of 
linear search via implementing a re-scale operator on the 
composite estimator obtained by the previous iteration 
step. Both theoretical and experimental studies illus¬ 
trated that RBoosting outperformed the original boosting 
and performed at least comparable to other variants of 
boosting. Theoretically, we proved that the numerical 
convergence rate of RBoosting was almost optimal in the 
sense that it cannot be essentially improved. Using this 
property, we then deduced a fairly tight generalization 
error bound of RBoosting, which was a new “record” 
for boosting-type algorithms. Experimentally, we showed 
that for a number of numerical experiments, RBoosting 
outperformed boosting, and performed at least the sec¬ 
ond best of all variants of boosting. All these results 
implied that RBoosting was an reasonable improvement 
of Boosting and the idea “re-scale” provided a new 
direction to improve the performance of boosting. 

To stimulate more opinions from others on RBoosting, 
we present the following two remarks at the end of this 
paper. 

Remark 1: Throughout the paper, up to the theoretical 
optimality, we can not provide any essential advantages 
of RBoosting in applications, which makes it difficult 
to persuade the readers to use RBoosting rather than 
other variants of boosting. We highlight that there may 
be two merits of RBoosting in applications. The first 
one is that, due to the good theoretical behavior, if the 
parameters of RBoosting are appropriately selected, then 
RBoosting may outperform other variants. This conclu¬ 
sion was partly verified in our experimental studies in 
the sense that for all the numerical examples, RBoosting 
performed at least the second best. The other merit is 
that, compared with other variants, RBoosting cheers a 
totally different direction to improve the performance of 
boosting. Therefore, it paves a new way to understand 
and improve boosting. Furthermore, we guess that if the 
idea of the “re-scale” in RBoosting and “regularization” 
in other variants of boosting are synthesized to develop 
a new boosting-type algorithm, such as the re-scale e- 
boosting, re-scale RTboosting, then the performance may 
be further improved. We will keep working on this issue 
and report our progress in a future publication. 

Remark 2: According to the “no free lunch” philos¬ 


ophy, all the variants improve the learning performance 
of boosting at the cost of introducing additional param¬ 
eters, such as the truncated parameter in RTboosting, 
regularization parameter in RSboosting, e in e-Boosting, 
and shrinkage degree in RBoosting. To facilitate the use 
of these variants, one should also present strategies to 
select such parameters. In particular, Elith et al. lfl4l 
showed that 0.1 is a feasible choice of e in e-Boosting; 
Biihlmann and Hothorn @ recommended the selection 
of 0.1 for the regularization parameter in RSboosting; 
Zhang and Yu 1341 proved that 0(k ~ 2//3 ) is a good 
value of the truncated parameter in RTboosting. One 
may naturally ask: how to select the shrinkage degree 
Oik in RBoosting? This is a good question and we find 
a bit headache to answer it. Admittedly, in this paper, 
we do not give any essential suggestion to practically 
attack this question. In fact, 0 } ; plays an important role in 
RBoosting. If ak is too small, then RBoosting performs 
similar as the original boosting, which can be regarded 
as a special RBoosting with ak = 0. If ak is too large, 
an extreme case is ak close to 1, then the numerical 
convergence rate of RBoosting is also slow. Although we 
theoretically present some values of the ak, the best one 
in applications, we think, should be selected via some 
model selection strategies. We leave this important issue 
into a future study PTI . where the concrete role of the 
shrinkage degree will be revealed. 
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